Data Analysis

Week 8

Jafet Belmont & Jenn Gaskell

School of Mathematics and Statistics

ILOs

By the end of this session you will be able to:

  • Fit and interpret the output of a logistic regression with grouped binary data.

  • Fit and interpret the coefficients of a nominal logistic model.

Assessment - Overview

Release Topics Assessment Weights Due_Date
3 Weeks: 1-3 Quiz 1 15 4
3 Research Peer assessment 20 9 (End of Submission Stage)
10 (End of Assessment Stage)
6 Weeks:4-6 Quiz 2 15 8
9 Weeks: 1-8 Class test 25 9
4 Weeks: 1-8 Groups project 25 11

Upcoming assessments

Peer assessment

  • Research a randomly assigned topic and record an (up to) 15 minute presentation

    • Submission Due on week(s):

      Week 9 (Mar 10) 11:00 am BST.

      No late submission are allowed.

    • Review Due on week(s)

      Week 10 (Mar 21) 11:00 am BST

Upcoming assessments

Class test (Next week)

  • Conduct a Data analysis and write a short report in quarto using the techniques covered in the course

    • Timed: The test will begin at 11:30 am at BO 418 (duration 1 and a half hour 1).

    • Intructions for the class test will become available after today’s lab. Please read them carefully before the class test.

    • The specific tasks for the class test and the corresponding files will become available on the the day of the test.

Upcoming assessments

Groups project

  • Using collaborative coding, carry out data analysis on a given dataset and then write a report.

    • Group Allocations will be available online after the class.

    • Report submission due on week(s): 11

    • Contribution evaluations and Declaration of Originality

Overview of today’s session

Required R packages

Before we proceed, load all the packages needed for this week:

library(tidyr)
library(ggplot2) 
library(moderndive) 
library(sjPlot)
library(tidymodels) 
library(broom) 
library(performance) 
library(faraway) 
library(palmerpenguins)

First example - Turtles sex determination

This week we will continue reviewing logistic regression to model grouped binary outcomes (e.g. number of successes out of a fixed number of trials)

The turtle data set within the faraway library (also available on Moodle) contains the number of hatched male and female turtles across different temperatures, with 3 independent replicates for each temperature.

turtles = faraway::turtle 

turtles = turtles %>% 
  mutate(totals = male+female,
         male_props = male/totals)

Logistic regression for grouped data in R

\[ \begin{align}y_i &\sim \mathrm{Binomial}(n_i,p_i)\\\mathrm{logit}(p_i) &= \beta_0 +\beta_1 \times \mathrm{temperature}.\end{align} \]

model_turtles <- glm(male_props ~ temp,
                     data = turtles,
                     weights =  totals,
                     family = binomial)

Interpreting the log-odds

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -61.32 12.02 -5.10 0 -86.83 -39.73
temp 2.21 0.43 5.13 0 1.44 3.13
  • For every unit increase in Temperature, the log-odds of a male being hatched increase by 2.21.

  • For every unit increase in Temperature, the odds of hatching a male are \(\mathrm{exp}(\beta_1) = 9.13\) times the odds of those with one temperature unit less.

  • If an egg is incubated at a temperature of 27.5 degrees, what at the chances (odds) of a female being hatched, i.e. \(\mathrm{Odds}(male=0|temp = 27.5)\)?

  • What is the probability of a turtle egg that is incubated in a temperature of 28.5 degrees to become a male?

Modelling grouped binary data with a categorical covariate

To illustrate how the previous model works with categorical predictors we can discretized the temperature values into arbitrary categories as follows:

\[ \mathrm{temperature~ category} =\begin{cases} \mathrm{temperature} > 29^\circ C & \text{high} \\ \mathrm{temperature} > 28^\circ C& \text{medium} \\ \text{else} & \text{low}\end{cases} \]

Code
turtles = turtles  %>% mutate(
  temp_fct = case_when(
    temp > 29 ~ "high",
    temp > 28 ~ "medium",
    .default = "low"
  ) %>% as.factor()
) 

Model fitting

  • Changing the baseline (reference) category to low.
  • The Model

\[ \begin{align} y_i &\sim \mathrm{Binomial}(n_ip_i)\\ \mathrm{logit}(p_i) &= \alpha + \beta_{1} \times \mathbb{I}_{\mathrm{temperature}}(\mathrm{high}) + \beta_{2} \times \mathbb{I}_{\mathrm{temperature}}(\mathrm{medium}). \end{align} \] - \(\alpha\) represent the log-odds of a male turtle being hatched in low incubation temperature.

  • \(\beta_1\) are the change int the log-odds of a male turtle being hatched given it was incubated in a high temperature condition compared to a low one.

  • \(\mathbb{I}_{\mathrm{temperature}}(\mathrm{high})\) is an indicator variable that takes the value of 1 if the \(i\)th experiment replicate was conducted on a high temperature.

  • \(\beta_2\) are the change int the log-odds of a male turtle being hatched given it was incubated in a medium condition compared to a low one.

  • \(\mathbb{I}_{\mathrm{temperature}}(\mathrm{medium})\) is an indicator variable that takes the value of 1 if the \(i\)th experiment replicate was conducted on a medium temperature.

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched if it was incubated on a low temperature condition are 0.59 the odds of a female being hatched if it was incubated on the same condition.

  • In other words, the odds of female being hatched in a low temperature are \(\mathrm{exp}(\alpha)^{-1} =\) 1.68 higher than the odds of a male being hatched under low temperature.

-However, there is not enough evidence to support that this change in the odds is statistically significant since the confidence interval ( 0.33 , 1.04) contains 1 (remember we are in the odds scale).

Model fitting

The model estimates on the odds scale with 95% confidence intervals

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.59 0.29 -1.80 0.07 0.33 1.04
temp_fcthigh 45.47 1.06 3.61 0.00 8.59 843.97
temp_fctmedium 6.32 0.44 4.23 0.00 2.76 15.31
  • The odds of a male being hatched are 45.47! significantly higher in a high temperature setting compared to a low temperature.

  • The odds of a male being hatched are 6.32 higher in a medium temperature condition compared to a low one.

  • In other words, the odds of female being hatched in a low temperature are \(\mathrm{exp}(\alpha)^{-1} =\) 1.68 higher than the odds of a male being hatched under low temperature.

  • What if we want to compare the odds of a male being hatched if the egg was incubate on a high temperture condition against a medium one?

Models for multiple categorical responses

We will look at logistic regression models applied to nominal (unordered) responses with more than two categories.

  • The basis for modelling categorical data with more than two categories is the multinomial distribution for a given a random variable \(Y\) with \(J\) categories:

\[ f(\mathbf{y} | n)=\frac{n!}{y_1! y_2! \dots y_J!}p_1^{y_1}p_2^{y_2}\dots p_J^{y_J}. \qquad(1)\]

  • \(p_1,p_2,\dots, p_J\) are the probabilities associated with each of the \(J\) categories, with \(\sum_j^J p_j = 1\).

  • Suppose \(n\) independent observations from which

    \(y_1\) outcomes in category 1

    \(y_2\) outcomes in category 2

    \(\vdots\)

    \(y_J\) outcomes in category J.

    \(\Rightarrow \sum_{j=1}^J y_j = n\) and \(\mathbf{y}=(y_1,y_2,\dots,y_J)^\intercal \sim \mathrm{Multinom}(n;p_j,\ldots,p_J)\)

Nominal logistic regression

  • The goal is to estimate the probabilities for each category based on some explanatory variables \(\mathbf{x}\).

  • The probability of the \(j\)th category is then given by:

\[ P(Y=j|\mathbf{x})= \dfrac{\mathrm{exp}(\mathbf{x}^\intercal \boldsymbol{\beta}_j)}{\sum_{k=1}^J\mathrm{exp}(\mathbf{x}^\intercal \boldsymbol{\beta}_k)} \]

  • As with logistic regression, one category is arbitrarily chosen as the reference category, and all other categories are compared with it.

  • If category \(J\) is chosen as the reference category. The log-odds for the other categories relative to the reference are:

\[ \mathrm{log}\left(\dfrac{P(Y=j|\mathbf{x})}{P(Y=J|\mathbf{x})}\right) = \mathrm{log}\left(\dfrac{p_j}{p_J}\right) =\mathbf{x}^\intercal\boldsymbol{\beta}_j, ~~\text{for } j=1,\dots,J-1. \qquad(2)\]

Nominal logistic regression

We can easily derive the estimated probabilities for each class:

  • For the reference class \(J:\)\[ \hat{p}_J = \dfrac{1}{1+\sum_{j=1}^{J-1} \mathrm{exp}(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)} \]

  • For class \(j = 1,2,\ldots,J-1:\) \[\hat{p}_j=\dfrac{\exp(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)}{1+\sum_{j=1}^{J-1}\exp(\mathbf{x}^\intercal\hat{\boldsymbol{\beta}}_j)}. \qquad(3)\]

Fitting a nominal logistic regression in R

In this example we look at data on subjects that were interviewed about the importance of various features when buying a car. The data set is available on the dobson R package (also available in Moodle)

  • sex: woman/man
  • age: 18-23, 24-40, >40
  • response: no/little, important, very important
  • frequency: number of interviewed people on each group

Fitting a nominal logistic regression in R

We can fit the following nominal logistic regression model using the multinom() function from library(nnet):

\[ \mathrm{log}\left(\frac{p_j}{p_1} \right) = \beta_{0j}+\beta_{1j}x_1+\beta_{2j}x_2+\beta_{3j}x_3, ~~ j=2,3 \]

where

  • \(j=1\) for “no/little importance” (the reference category)
  • \(j=2\) for “important”
  • \(j=3\) for “very important”
  • \(x_1=1\) for women and 0 for men,
  • \(x_2=1\) for age 24-40 years and 0 otherwise
  • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.
# weights:  15 (8 variable)
initial  value 329.583687 
iter  10 value 290.566455
final  value 290.351098 
converged

Interpreting the output

Let look at model summaries and interpret the coefficients.

y.level term estimate std.error statistic p.value conf.low conf.high
important (Intercept) -0.98 0.26 -3.82 0.00 -1.48 -0.48
important age24-40 1.13 0.34 3.30 0.00 0.46 1.80
important age> 40 1.59 0.40 3.94 0.00 0.80 2.38
important sexwomen 0.39 0.30 1.29 0.20 -0.20 0.98
very important (Intercept) -1.85 0.33 -5.60 0.00 -2.50 -1.20
very important age24-40 1.48 0.40 3.69 0.00 0.69 2.26
very important age> 40 2.92 0.42 6.90 0.00 2.09 3.75
very important sexwomen 0.81 0.32 2.53 0.01 0.18 1.44

Examples: Interpreting the output

  1. What are the log-odds of a person to consider power steering and air conditioning important vs little important?
  2. What are the log-odds of a male who is in the age range of 18-23 to consider power steering and air conditioning important vs little important?
y.level term estimate std.error statistic p.value conf.low conf.high
important (Intercept) -0.98 0.26 -3.82 0.0 -1.48 -0.48
important age24-40 1.13 0.34 3.30 0.0 0.46 1.80
important age> 40 1.59 0.40 3.94 0.0 0.80 2.38
important sexwomen 0.39 0.30 1.29 0.2 -0.20 0.98
  1. \[ \mathrm{log}\left(\frac{p_2}{p_1} \right) = -0.98 +0.39 x_1+ 1.13 x_2+1.59x_3, \]

    • \(x_1=1\) for women and 0 for men

    • \(x_2=1\) for age 24-40 years and 0 otherwise

    • \(x_3=1\) for age \(>\) 40 years and 0 otherwise.

  2. \(\log\text{Odds}(\text{Feature = important|gender=Male,Age = 18-23}) = -0.98\)

Examples: Interpreting the output

  1. What are the odds of a male over 40 years old to consider power steering and air conditioning important vs little important?

  2. What are the odds of a person over 40 years old to consider power steering and air conditioning important vs little important compared to a younger person between 18-23 years old?

y.level term estimate std.error statistic p.value conf.low conf.high
important (Intercept) -0.98 0.26 -3.82 0.0 -1.48 -0.48
important age24-40 1.13 0.34 3.30 0.0 0.46 1.80
important age> 40 1.59 0.40 3.94 0.0 0.80 2.38
important sexwomen 0.39 0.30 1.29 0.2 -0.20 0.98
  1. \(\text{Odds}(\text{Feauture=important|Age > 40, Gender = male}) = \exp(-0.98 + 1.59) \approx 1.82\), since male is our reference category for gender.

  2. \(\dfrac{\text{Odds}(\text{Feature = important|Age > 40})}{\text{Odds}(\text{Feature = little important|Age = 18-23})} = \dfrac{\exp(-0.98 + 1.59)}{\exp(-0.98)}= \exp(1.59) \approx 4.89\)

    • The odds of a 40 years old to consider this feature important are \(\exp(1.59) \approx 4.89\) times higher compared to a person who is in the 18-23 age category (which is the base line for age).

Examples: Interpreting the output

  1. What are the odds of a over female to consider power steering and air conditioning important vs little important compared to a male?

  2. What are the odds of a female in the 24-40 age category to consider power steering and air conditioning important vs little important compared to a younger female between 18-23 ?

y.level term estimate std.error statistic p.value conf.low conf.high
important (Intercept) -0.98 0.26 -3.82 0.0 -1.48 -0.48
important age24-40 1.13 0.34 3.30 0.0 0.46 1.80
important age> 40 1.59 0.40 3.94 0.0 0.80 2.38
important sexwomen 0.39 0.30 1.29 0.2 -0.20 0.98
  1. \(\dfrac{\text{Odds}(\text{Feauture=important| Gender = female})}{\text{Odds}(\text{Feauture= little important| Gender = male})} = \dfrac{\exp(-0.98+0.39)}{\exp(-0.98)}=\exp(0.39) \approx 1.47\)

  2. \(\dfrac{\text{Odds}(\text{Feauture=important|Age = 24-40, Gender = female})}{\text{Odds}(\text{Feauture=little important|Age = 18-23, Gender = female})} = \dfrac{-0.98 + 1.13 + 0.39}{-0.98 + 0.39}= \exp(1.13) \approx 3.09\)